{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Standardize and deduplicate fiction metadata\n",
    "\n",
    "This notebook begins after masterficmeta has been created.\n",
    "\n",
    "The next step is to produce a dataset where there's a single copy of each physical volume. In terms of [FRBR,](https://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records) this is roughly the \"manifestation\" level.\n",
    "\n",
    "In other words, after this round of dedup, there will still be several copies of *Middlemarch* in the dataset: maybe a three-volume edition from the 1870s, as well as one-volume editions in 1892 and 1950, adding up to five volumes. But there should be only one copy of each volume, in each edition.\n",
    "\n",
    "I'll achieve that very simply by relying on HTRC record IDs + volume numbers.\n",
    "\n",
    "This notebook also prepares for the next stage of deduplication by loosely standardizing author names.\n",
    "\n",
    "The output is written to disk as manifestationmeta.tsv.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some generic imports.\n",
    "\n",
    "import csv\n",
    "import pandas as pd\n",
    "import unicodedata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### General strategy for standardizing authors\n",
    "\n",
    "We're not going to use fuzzy matching on this pass; we're going to make very conservative assumptions and link authors only if\n",
    "\n",
    "    1) They have the same last name\n",
    "    2) Also the same initials\n",
    "    3) And also *either* the same birthdate, or\n",
    "    4) Are listed as authoring the same title.\n",
    "\n",
    "Very short author names will also be ignored.\n",
    "\n",
    "To do this we need dictionaries of authors' birthdates and titles-authored. While we're at it we can group authors into blocks that share the same first two letters, to speed up the comparison process later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read in the existing metadata, and normalize unicode\n",
    "\n",
    "def normalize_unicode(astring):\n",
    "    if pd.isnull(astring):\n",
    "        return ''\n",
    "    else:\n",
    "        astring = astring.replace('  ', ' ')\n",
    "        astring = unicodedata.normalize('NFC', astring)\n",
    "        return astring\n",
    "\n",
    "meta = pd.read_csv('../masterficmetadata.tsv', sep = '\\t', low_memory = False)\n",
    "meta = meta.assign(newauthor = meta.author.map(normalize_unicode))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = meta.groupby('newauthor')\n",
    "titlesets = dict()\n",
    "birthdates = dict()\n",
    "authorblocks = dict()\n",
    "\n",
    "for author, group in grouped:\n",
    "    birthset = set()\n",
    "    titleset = set()\n",
    "    for ix in group.index:\n",
    "        authordate = group.loc[ix, 'authordate']\n",
    "        if not pd.isnull(authordate) and len(authordate) >= 4:\n",
    "            try:\n",
    "                birth = int(authordate[0:4])\n",
    "                birthset.add(birth)\n",
    "            except:\n",
    "                pass\n",
    "        title = group.loc[ix, 'shorttitle']\n",
    "        if not pd.isnull(title):\n",
    "            if len(title) > 20:\n",
    "                title = title[0:20]\n",
    "            titleset.add(title)\n",
    "    if len(birthset) == 1:\n",
    "        birthdates[author] = birthset.pop()\n",
    "    titlesets[author] = titleset\n",
    "    \n",
    "    if len(author) > 2:\n",
    "        initial = author[0:2].lower()\n",
    "    else:\n",
    "        initial = 'xx'\n",
    "    if initial not in authorblocks:\n",
    "        authorblocks[initial] = set()\n",
    "    authorblocks[initial].add(author)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### actually generate standardized names\n",
    "\n",
    "We go through each block, looking for pairs of authors who meet the above criteria.\n",
    "\n",
    "When a matched pair are found, we add them as a new group, unless one member of the pair is in an existing group, in which case, they are both added to that group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = []\n",
    "\n",
    "def zapnsplit(a):\n",
    "    a = a.replace('from old catalog', ' ')\n",
    "    a = a.replace('.', ' ')\n",
    "    a = a.replace('(', ' ')\n",
    "    a = a.replace(')', ' ')\n",
    "    a = a.replace('[', ' ')\n",
    "    a = a.replace(']', ' ')\n",
    "    a = a.replace(',', ' ')\n",
    "    a = a.replace('Mrs', ' ')\n",
    "    a = a.replace('Sir', ' ')\n",
    "    # we don't want honorifics to interfere with \n",
    "    # identity\n",
    "    \n",
    "    parts = a.lower().split()\n",
    "    lastname = parts[0]\n",
    "    parts = [x for x in parts if len(x) > 0]\n",
    "    return set(parts), lastname\n",
    "\n",
    "def matchnames(namea, nameb):\n",
    "    a, lasta = zapnsplit(namea)\n",
    "    b, lastb = zapnsplit(nameb)\n",
    "    \n",
    "    if lasta != lastb:\n",
    "        return False\n",
    "    \n",
    "    missing = a - b\n",
    "    for m in missing:\n",
    "        if m[0] not in b:\n",
    "            return False\n",
    "    # for each word that's missing in b,\n",
    "    # we check to see if the initial is present in B\n",
    "    \n",
    "    missing = b - a\n",
    "    for m in missing:\n",
    "        if m[0] not in a:\n",
    "            return False\n",
    "    # likewise for a\n",
    "    \n",
    "    # If all tests are passed,  \n",
    "    return True\n",
    "\n",
    "# iterate through author blocks and actually do the work\n",
    "\n",
    "for initial, block in authorblocks.items():\n",
    "        \n",
    "    for a1 in block:\n",
    "        for a2 in block:\n",
    "            if a1 == a2:\n",
    "                continue\n",
    "            if len(a1) < 9 or len(a2) < 9:\n",
    "                continue\n",
    "            if a1[0:5] != a2[0:5]:\n",
    "                continue\n",
    "            \n",
    "            if len(a1) > 21:\n",
    "                trunca1 = a1[0:21]\n",
    "            else:\n",
    "                trunca1 = a1\n",
    "                \n",
    "            if len(a2) > 21:\n",
    "                trunca2 = a2[0:21]\n",
    "            else:\n",
    "                trunca2 = a2\n",
    "                \n",
    "            if trunca1 != trunca2 and not matchnames(a1, a2):\n",
    "                continue\n",
    "            \n",
    "            titlematch = 0\n",
    "            \n",
    "            birthmatch = False\n",
    "            if (a1 in birthdates and a2 in birthdates) and (birthdates[a1] == birthdates[a2]):\n",
    "                birthmatch = True\n",
    "                titlematch += 1\n",
    "                # not requiring titlematch if birthdates match\n",
    "                \n",
    "            elif a1 not in birthdates:\n",
    "                birthmatch = True\n",
    "            elif a2 not in birthdates:\n",
    "                birthmatch = True\n",
    "            \n",
    "            if not birthmatch:\n",
    "                continue\n",
    "            \n",
    "            for t1 in titlesets[a1]:\n",
    "                for t2 in titlesets[a2]:\n",
    "                    if t1 == t2:\n",
    "                        titlematch += 1\n",
    "            \n",
    "            if titlematch == 0:\n",
    "                continue\n",
    "            else:\n",
    "                found = False\n",
    "                for g in groups:\n",
    "                    if a1 in g or a2 in g:\n",
    "                        g.add(a1)\n",
    "                        g.add(a2)\n",
    "                        found = True\n",
    "                        break\n",
    "                        \n",
    "                if not found:\n",
    "                    groups.append({a1, a2})\n",
    "                    \n",
    "                "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1148"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(groups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write the author groups to file for future reference.\n",
    "\n",
    "with open('newauthorgroups.tsv', mode = 'w', encoding = 'utf-8') as f:\n",
    "    for g in groups:\n",
    "        of_record = ''\n",
    "        for n in g:\n",
    "            official = n.replace('from old catalog', '')\n",
    "            official = official.strip('[], .')\n",
    "            if official[-1].isupper():\n",
    "                official = official + '.'\n",
    "            if len(of_record) < 1 or len(official) > len(of_record):\n",
    "                of_record = official\n",
    "        outline = of_record + '\\t' + '\\t'.join(g) + '\\n'\n",
    "        f.write(outline)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's read that back in, to create a translation dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "author_trans = dict()\n",
    "\n",
    "with open('newauthorgroups.tsv', encoding = 'utf-8') as f:\n",
    "    for line in f:\n",
    "        fields = line.strip('\\n').split('\\t')\n",
    "        official = fields [0]\n",
    "        for name in fields:\n",
    "            author_trans[name] = official"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Actually standardize the author names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def map_names(aname):\n",
    "    global author_trans\n",
    "    if aname in author_trans:\n",
    "        return author_trans[aname]\n",
    "    else:\n",
    "        return aname\n",
    "\n",
    "meta = meta.assign(newauthor = meta.newauthor.map(map_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Let's also standardize authordates\n",
    "\n",
    "Some rows have birth and/or death dates for an author; others for the same author may lack that info. We're going to want to take the richest available info, and spread it across all rows for the author. \"Longest\" is not the world's best metric of \"richest,\" but in practice, for this case, it will do."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "authordates = dict()\n",
    "grouped = meta.groupby('newauthor')\n",
    "for author, group in grouped:\n",
    "    longest = ''\n",
    "    for d in group.authordate:\n",
    "        d = str(d)\n",
    "        if pd.isnull(d):\n",
    "            continue\n",
    "        if len(d) > len(longest) and d.lower() != 'nan':\n",
    "            longest = d\n",
    "    authordates[author] = longest\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The actual deduplication\n",
    "\n",
    "We create a column that is str(recordid) + str(volnum), so we can group on the combination. (Passing both to groupby gives you a multi-index, which is more complexity than I need today.)\n",
    "\n",
    "We also record the number of instances of a recordid-volnum combination that are being collapsed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def concatenate_fields(row):\n",
    "    ''' Concatenate two fields, with special provision\n",
    "    for the likelihood that the second is null.\n",
    "    '''\n",
    "    fielda = row['recordid']\n",
    "    fieldb = row['volnum']\n",
    "    \n",
    "    if pd.isnull(fieldb):\n",
    "        fieldb = 'nan'\n",
    "    else:\n",
    "        fieldb = str(fieldb)\n",
    "    \n",
    "    return str(fielda) + '+' + fieldb\n",
    "\n",
    "meta = meta.assign(groupingcolumn = meta.apply(concatenate_fields, axis = 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "10001\n",
      "20001\n",
      "30001\n",
      "40001\n",
      "50001\n",
      "60001\n",
      "70001\n",
      "80001\n",
      "90001\n",
      "100001\n",
      "110001\n",
      "120001\n",
      "130001\n",
      "140001\n",
      "150001\n",
      "160001\n",
      "170001\n"
     ]
    }
   ],
   "source": [
    "kept = []\n",
    "ctr = 0\n",
    "instances = dict()\n",
    "grouped = meta.groupby(['groupingcolumn'])\n",
    "for key, group in grouped:\n",
    "    ctr += 1\n",
    "    if ctr % 10000 == 1:\n",
    "        print(ctr)\n",
    "    keeper = ''\n",
    "    lowest = 2100\n",
    "    for idx in group.index:\n",
    "        date = int(group.loc[idx, 'inferreddate'])\n",
    "        if (date < lowest and date > 1699) or lowest == 2100:\n",
    "            lowest = date\n",
    "            \n",
    "            if lowest < 1700:\n",
    "                lowest = 2100\n",
    "            # dubious \"dates\" should not outcompete real dates\n",
    "            \n",
    "            keeper = group.loc[idx, 'docid']\n",
    "            if type(keeper) != str:\n",
    "                keeper = keeper[0]\n",
    "    numcopies = len(group.index)\n",
    "    instances[keeper] = numcopies\n",
    "    kept.append(keeper)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That created a list of docids to keep. Now we just have to keep them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(210278, 26)\n",
      "(176623, 25)\n",
      "(176623, 25)\n"
     ]
    }
   ],
   "source": [
    "# actually do the deduplication\n",
    "print(meta.shape)\n",
    "meta.set_index('docid', inplace = True)\n",
    "deduped = meta.loc[kept, : ]\n",
    "print(deduped.shape)\n",
    "deduped = deduped[~deduped.index.duplicated(keep='first')]\n",
    "print(deduped.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In winnowing duplicates, we don't want to lose authors' birth and death dates. Let's insure we have the richest information.\n",
    "\n",
    "Also, while we're at it, let's add a column recording the number of instances for each record-vol."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def map_instances(docid):\n",
    "    global instances\n",
    "    if docid in instances:\n",
    "        return instances[docid]\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "deduped['instances'] = deduped.index.map(map_instances)\n",
    "\n",
    "def enrich_authordates(row):\n",
    "    ''' Enriches the authordate with a longer form where appropriate\n",
    "    '''\n",
    "    global authordates\n",
    "    \n",
    "    authdate = row['authordate']\n",
    "    if pd.isnull(authdate):\n",
    "        authdate = ''\n",
    "    author  = row['newauthor']\n",
    "    if author in authordates and len(authordates[author]) > len(authdate) and authordates[author] != 'nan':\n",
    "        return authordates[author]\n",
    "    else:\n",
    "        return authdate\n",
    "    \n",
    "deduped = deduped.assign(authordate = deduped.apply(enrich_authordates, axis = 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>oldauthor</th>\n",
       "      <th>author</th>\n",
       "      <th>authordate</th>\n",
       "      <th>inferreddate</th>\n",
       "      <th>latestcomp</th>\n",
       "      <th>datetype</th>\n",
       "      <th>startdate</th>\n",
       "      <th>enddate</th>\n",
       "      <th>imprint</th>\n",
       "      <th>imprintdate</th>\n",
       "      <th>...</th>\n",
       "      <th>place</th>\n",
       "      <th>recordid</th>\n",
       "      <th>enumcron</th>\n",
       "      <th>volnum</th>\n",
       "      <th>title</th>\n",
       "      <th>parttitle</th>\n",
       "      <th>shorttitle</th>\n",
       "      <th>newauthor</th>\n",
       "      <th>groupingcolumn</th>\n",
       "      <th>instances</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>docid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>uc2.ark:/13960/t4mk67z1j</th>\n",
       "      <td>Jones, J. D. (John Daniel)</td>\n",
       "      <td>Jones, J. D. (John Daniel)</td>\n",
       "      <td>1865-1942.</td>\n",
       "      <td>1900</td>\n",
       "      <td>1900</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1989</td>\n",
       "      <td>New York|George H. Doran|19--</td>\n",
       "      <td>&lt;estimate=\"[19--]\"&gt;</td>\n",
       "      <td>...</td>\n",
       "      <td>nyu</td>\n",
       "      <td>100000247</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Lord of life and death / | $c: by J.D. Jones.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The Lord of life and death</td>\n",
       "      <td>Jones, J. D. (John Daniel)</td>\n",
       "      <td>100000247+nan</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc2.ark:/13960/t39z92s0m</th>\n",
       "      <td>Ovcharenko, Ivan Vasilʹevich</td>\n",
       "      <td>Ovcharenko, Ivan Vasilʹevich</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1900</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1985</td>\n",
       "      <td>London|Modern Books|19--?</td>\n",
       "      <td>&lt;estimate=\"[19--?]\"&gt;</td>\n",
       "      <td>...</td>\n",
       "      <td>enk</td>\n",
       "      <td>100000271</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>In a ring of fire; | memories of a partisan.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>In a ring of fire; memories of a partisan</td>\n",
       "      <td>Ovcharenko, Ivan Vasilʹevich</td>\n",
       "      <td>100000271+nan</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc2.ark:/13960/t0ht2j50b</th>\n",
       "      <td>Phillpotts, Eden</td>\n",
       "      <td>Phillpotts, Eden</td>\n",
       "      <td>1862-1960.</td>\n",
       "      <td>1900</td>\n",
       "      <td>1900</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1987</td>\n",
       "      <td>Paris|Thomas Nelson and Sons|19--?</td>\n",
       "      <td>&lt;estimate=\"[19--?]\"&gt;</td>\n",
       "      <td>...</td>\n",
       "      <td>fr</td>\n",
       "      <td>100000276</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Old Delabole / | $c: by Eden Phllpotts.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Old Delabole</td>\n",
       "      <td>Phillpotts, Eden</td>\n",
       "      <td>100000276+nan</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc2.ark:/13960/t1sf2p90r</th>\n",
       "      <td>Rosborough, Alexander J</td>\n",
       "      <td>Rosborough, Alexander J</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1900</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1986</td>\n",
       "      <td>Yreka, CA|News-Journal Print Shop|19]̲̲</td>\n",
       "      <td>&lt;estimate=\"19]̲̲\"&gt;</td>\n",
       "      <td>...</td>\n",
       "      <td>cau</td>\n",
       "      <td>100000283</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The mountie and the sourdough / | $c: by Alexa...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The mountie and the sourdough</td>\n",
       "      <td>Rosborough, Alexander J</td>\n",
       "      <td>100000283+nan</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uc2.ark:/13960/t9c53hj4v</th>\n",
       "      <td>Verne, Jules</td>\n",
       "      <td>Verne, Jules</td>\n",
       "      <td>1828-1905.</td>\n",
       "      <td>1900</td>\n",
       "      <td>1900</td>\n",
       "      <td></td>\n",
       "      <td>1900</td>\n",
       "      <td>1986</td>\n",
       "      <td>New York|Phoenix Publishing Co.|19--?</td>\n",
       "      <td>&lt;estimate=\"[19--?]\"&gt;</td>\n",
       "      <td>...</td>\n",
       "      <td>nyu</td>\n",
       "      <td>100000299</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The tour of the world in eighty days / | $c: b...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The tour of the world in eighty days</td>\n",
       "      <td>Verne, Jules</td>\n",
       "      <td>100000299+nan</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             oldauthor  \\\n",
       "docid                                                    \n",
       "uc2.ark:/13960/t4mk67z1j    Jones, J. D. (John Daniel)   \n",
       "uc2.ark:/13960/t39z92s0m  Ovcharenko, Ivan Vasilʹevich   \n",
       "uc2.ark:/13960/t0ht2j50b              Phillpotts, Eden   \n",
       "uc2.ark:/13960/t1sf2p90r       Rosborough, Alexander J   \n",
       "uc2.ark:/13960/t9c53hj4v                  Verne, Jules   \n",
       "\n",
       "                                                author  authordate  \\\n",
       "docid                                                                \n",
       "uc2.ark:/13960/t4mk67z1j    Jones, J. D. (John Daniel)  1865-1942.   \n",
       "uc2.ark:/13960/t39z92s0m  Ovcharenko, Ivan Vasilʹevich               \n",
       "uc2.ark:/13960/t0ht2j50b              Phillpotts, Eden  1862-1960.   \n",
       "uc2.ark:/13960/t1sf2p90r       Rosborough, Alexander J               \n",
       "uc2.ark:/13960/t9c53hj4v                  Verne, Jules  1828-1905.   \n",
       "\n",
       "                          inferreddate  latestcomp datetype startdate enddate  \\\n",
       "docid                                                                           \n",
       "uc2.ark:/13960/t4mk67z1j          1900        1900               1900    1989   \n",
       "uc2.ark:/13960/t39z92s0m          1900        1900               1900    1985   \n",
       "uc2.ark:/13960/t0ht2j50b          1900        1900               1900    1987   \n",
       "uc2.ark:/13960/t1sf2p90r          1900        1900               1900    1986   \n",
       "uc2.ark:/13960/t9c53hj4v          1900        1900               1900    1986   \n",
       "\n",
       "                                                          imprint  \\\n",
       "docid                                                               \n",
       "uc2.ark:/13960/t4mk67z1j            New York|George H. Doran|19--   \n",
       "uc2.ark:/13960/t39z92s0m                London|Modern Books|19--?   \n",
       "uc2.ark:/13960/t0ht2j50b       Paris|Thomas Nelson and Sons|19--?   \n",
       "uc2.ark:/13960/t1sf2p90r  Yreka, CA|News-Journal Print Shop|19]̲̲   \n",
       "uc2.ark:/13960/t9c53hj4v    New York|Phoenix Publishing Co.|19--?   \n",
       "\n",
       "                                   imprintdate    ...     place   recordid  \\\n",
       "docid                                             ...                        \n",
       "uc2.ark:/13960/t4mk67z1j   <estimate=\"[19--]\">    ...       nyu  100000247   \n",
       "uc2.ark:/13960/t39z92s0m  <estimate=\"[19--?]\">    ...       enk  100000271   \n",
       "uc2.ark:/13960/t0ht2j50b  <estimate=\"[19--?]\">    ...       fr   100000276   \n",
       "uc2.ark:/13960/t1sf2p90r    <estimate=\"19]̲̲\">    ...       cau  100000283   \n",
       "uc2.ark:/13960/t9c53hj4v  <estimate=\"[19--?]\">    ...       nyu  100000299   \n",
       "\n",
       "                         enumcron volnum  \\\n",
       "docid                                      \n",
       "uc2.ark:/13960/t4mk67z1j      NaN    NaN   \n",
       "uc2.ark:/13960/t39z92s0m      NaN    NaN   \n",
       "uc2.ark:/13960/t0ht2j50b      NaN    NaN   \n",
       "uc2.ark:/13960/t1sf2p90r      NaN    NaN   \n",
       "uc2.ark:/13960/t9c53hj4v      NaN    NaN   \n",
       "\n",
       "                                                                      title  \\\n",
       "docid                                                                         \n",
       "uc2.ark:/13960/t4mk67z1j  The Lord of life and death / | $c: by J.D. Jones.   \n",
       "uc2.ark:/13960/t39z92s0m       In a ring of fire; | memories of a partisan.   \n",
       "uc2.ark:/13960/t0ht2j50b            Old Delabole / | $c: by Eden Phllpotts.   \n",
       "uc2.ark:/13960/t1sf2p90r  The mountie and the sourdough / | $c: by Alexa...   \n",
       "uc2.ark:/13960/t9c53hj4v  The tour of the world in eighty days / | $c: b...   \n",
       "\n",
       "                         parttitle                                 shorttitle  \\\n",
       "docid                                                                           \n",
       "uc2.ark:/13960/t4mk67z1j       NaN                 The Lord of life and death   \n",
       "uc2.ark:/13960/t39z92s0m       NaN  In a ring of fire; memories of a partisan   \n",
       "uc2.ark:/13960/t0ht2j50b       NaN                               Old Delabole   \n",
       "uc2.ark:/13960/t1sf2p90r       NaN              The mountie and the sourdough   \n",
       "uc2.ark:/13960/t9c53hj4v       NaN       The tour of the world in eighty days   \n",
       "\n",
       "                                             newauthor groupingcolumn  \\\n",
       "docid                                                                   \n",
       "uc2.ark:/13960/t4mk67z1j    Jones, J. D. (John Daniel)  100000247+nan   \n",
       "uc2.ark:/13960/t39z92s0m  Ovcharenko, Ivan Vasilʹevich  100000271+nan   \n",
       "uc2.ark:/13960/t0ht2j50b              Phillpotts, Eden  100000276+nan   \n",
       "uc2.ark:/13960/t1sf2p90r       Rosborough, Alexander J  100000283+nan   \n",
       "uc2.ark:/13960/t9c53hj4v                  Verne, Jules  100000299+nan   \n",
       "\n",
       "                          instances  \n",
       "docid                                \n",
       "uc2.ark:/13960/t4mk67z1j          1  \n",
       "uc2.ark:/13960/t39z92s0m          1  \n",
       "uc2.ark:/13960/t0ht2j50b          1  \n",
       "uc2.ark:/13960/t1sf2p90r          1  \n",
       "uc2.ark:/13960/t9c53hj4v          1  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "deduped.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "manual = dict()\n",
    "with open('manual_author_matches.tsv', encoding = 'utf-8') as f:\n",
    "    reader = csv.DictReader(f, delimiter = '\\t')\n",
    "    for row in reader:\n",
    "        manual[row['alias']] = unicodedata.normalize('NFC', row['realname'])\n",
    "\n",
    "def manual_correction(author):\n",
    "    global manual\n",
    "    if author in manual:\n",
    "        return manual[author]\n",
    "    else:\n",
    "        return author"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we just write to file, after reassigning a column, dropping an extra column, and sorting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "deduped = deduped.assign(author = deduped.newauthor)\n",
    "deduped = deduped.assign(author = deduped.author.map(manual_correction))\n",
    "deduped.drop(labels = ['newauthor', 'groupingcolumn'], axis = 1, inplace = True)\n",
    "deduped.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)\n",
    "deduped.to_csv('newmanifestationmeta.tsv', sep = '\\t')\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
